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Model-Based Robot Learning 

Christopher G. Atkeson, Eric W. Aboaf, 
Joseph Mclntyre and David J. Reinkensmeyer 



1 Introduction 

An important component of human motor skill is the abil- 
ity to improve performance by practicing a task. 
Commands are refined on the basis of performance er- 
rors. It is often suggested that such learning reduces the 
need for an accurate internal model, a model of the me- 
chanical plant in the control system (see Arimoto, 1984b; 
Wang and Horowitz, 1985; and Harokopos, 1986 for ex- 
amples). This is not the case. Internal models play an 
important role in generating command corrections from 
performance errors. As an internal model is made more 
accurate, learning efficiency is improved, as is initial per- 
formance. 

This paper will show, in a series of examples, how 
internal models can be used as learning operators. The 
examples are 1) positioning a limb at a visual target, 2) 
throwing a ball at a target, and 3) following a defined 
trajectory. The essence of the model-based learning algo- 
rithms used to improve performance on these tasks is that 
internal models are used to transform performance errors 
into command corrections. 

The type of learning described in this paper - refining 
commands on the basis of practice - complements many 
other types of adaptive processes. Feedback controller de- 
signs can be improved by adaptive control algorithms. In- 
ternal models can be incrementally improved using system 
identification techniques. Trajectories can be optimized 
for particular tasks. Robot plans and programs can be 
debugged as errors are discovered during execution. This 
paper focuses on improving execution of a given task plan 
by refining the commands given to the robot. 



Model-Based Learning Algorithm Struc- 
ture 

The model- based learning algorithms described here all 
have the same form. Commands are refined on the basis 
of performance errors. A command is applied to the con- 
trolled system (Figure 1 A). Performance errors may result 
from errors in the command. A model of the inverse of 
the controlled system is used to estimate the errors in the 
command based on the measured performance or output 
errors (Figure IB). If the inverse model of the controlled 
system is perfect, the command errors would be correctly 
estimated and completely eliminated after one attempt at 
performing the task. (Of course, if a perfect model of the 
controlled system is available then the initial command 
would also have been perfect). Perfect knowledge of the 
controlled system is not usually available, and the model 
of the inverse of the controlled system will be incorrect. 
Due to the modeling errors, the command correction will 
be incomplete, and learning will be an iterative process of 
refining the command. 

There are three steps to the learning algorithms: com- 
mand initialization, execution, and modification. The ini- 
tial command is generated by applying the inverse model 
of the controlled system to the desired performance. Dur- 
ing execution, a command is applied to the system and 
the actual performance is monitored. The command cor- 
rection is calculated by applying the inverse model to the 
performance errors. The refined command is now exe- 
cuted. The cycle of command execution and modification 
is repeated until desired performance is achieved. 
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Figure 1: The inverse of the controlled system is used to estimate command errors from performance errors. 



2 A Kinematic Example 

The task of positioning the limb at a visual target will be 
used to provide a specific example of how model-based 
learning works. A robot arm and a target are viewed 
by a vision system (Figure 2). The robot arm servos 
to a commanded set of joint angles, 9, and the vision 
system measures the tip position, x, in vision system 
coordinates. The controlled system in this case trans- 
forms commanded joint angles into a measured tip posi- 
tion (Figure 1A): 

x = L(#) (1) 

The forward kinematics, L(), is in general a nonlinear 
transformation. For the purposes of this example we 
will assume there are no singularities or redundancies to 
resolve in the field of view of the vision system. For each 
desired tip position there is one and only one appropriate 
set of joint angles. 

A model of the inverse kinematics is used to transform 
the desired tip position, x rf , into an initial joint angle 
command, 9°, in the command initialization stage: 



9° = L- 1 ^) 



(2) 



A caret (*) is used to indicate a model or an estimate of 
a quantity. The initial joint angle command is applied 
in the first execution stage, and the corresponding tip 
position is measured: 



x° = L(#°) 



(3) 



The true system, jL(), and its inverse are unknown, and 
only imperfect models are available. Due to modeling 
errors, the actual tip position, x°, will not match the 
desired tip position, Xj. 

At this point we must decide how to transform the 
measured tip position error into a correction to the set 
of commanded joint angles. Performance errors must be 
mapped into command corrections. The same model of 
the inverse kinematics that was used to generate the ini- 
tial command, L~ 1 (), will be used to estimate the com- 
mand error (Figure IB). 

The command error, 60, is the difference between 
the currently commanded joint angles, 0°, and the (un- 
known) correct set of joint angles, which will be indicated 
as 0*. The command error can be computed in terms 



of the actual and desired performances using the true 
system inverse: 



60° = 0° - 0' = L-M* ) - -L-^x,,) 



W 



As we do not have perfect knowledge of the true system 
inverse, we must use a model of the system inverse to 
estimate the command error: 



60 



L'V) - L- l (x*) 



(5) 



The command is updated by simply subtracting the esti- 
mate of the command error from the previous command: 



9 1 =9°- f0° 



(6) 



If the model of the system inverse was perfect the 
command error would be estimated correctly and com- 
pletely eliminated on the next attempt. However, a 
model is rarely perfect, so command correction must be 
an iterative process of estimating a command error using 
an imperfect model, removing the estimated command 
error, applying the refined command, and using the re- 
sulting performance error and the model to estimate re- 
maining errors in the command. Equations (3), (5), and 
(6) can be indexed with * to indicate that they are ap- 
plied on each practice attempt, reflecting the iterative 
nature of the algorithm: 



1. Command initialization: 



9° = L- l (x<) 

2. Command execution: 

x* = £(#*) 

3. Command error estimation: 

ft* = L^tf) - l-ifa) 

4. Command modification: 

9 i+1 = i - 60' 



(7) 



(8) 



(9) 



(10) 



Steps 2, 3, and 4 are repeated until satisfactory 
performance is achieved. 
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Figure 2: A robot arm and a target are viewed by a 
vision system. 



Convergence 

The quality of the inverse model used as the learning 
operator determines how fast model-based learning con- 
verges. Fixed point theory can be used to analyze the 
general nonlinear case (Wang 1984, Wang and Horowitz 
1985). A learning algorithm can be viewed as a map- 
ping of commands on the »th attempt to commands on 
the next attempt: 



•»+i _ 



= F{9 i ) 



(11) 



The previously described algorithm can be put into this 
form by substituting equation (8)- into (9) and (9) into 
(10). The model-based learning algorithm modifies the 
t'th command by adding a correction based on the per- 
formance error transformed by the inverse model: 



tf' +1 = tf< - (L- W)) " £ _1 (xa)) 



(12) 



Note that when the desired performance, x<j, is achieved 
using the correct command, 0* , then L(0') = x.4 and 
equation (12) reduces to the fixed point i+1 = ( = 0*. 
We can ask whether this fixed point is stable by 
analyzing a linearization of equation (12) at the point 
(fl,x) = {0*,Xd). For a small perturbation 69 from the 
fixed point, 



L{9* + 60) = x d + J{9')69 



(13) 



where J is the Jacobean matrix of derivatives of L(). 
Similarly, for a small perturbation 6x from the fixed 
point, 

L-\y. d + 6x) = ZT l (x.i) + J-'MSx (14) 

where J -1 is the Jacobean matrix for the inverse model 
L -1 (). If on the t'th trial the command is perturbed 
from 9' by 60* so that 0' = 0' + 60\ the error in the 
next command, 60 i+1 = ,+1 - 0', can be computed by 
substituting equations (13) and (14) into equation (12): 



^ +l = (i-i- , (x^)J(r))*# < (is) 

If J" 1 is a correct inverse of J the command error will 
be completely corrected after one attempt, in the linear 
case. The command error 60 will decrease when all of 
the eigenvalues of the matrix (1 -J -1 J) are less than one 
in absolute value, with the rate of decrease determined 
by the magnitude of the eigenvalues. If the magnitude of 
any eigenvalue is greater than one, the learning process 
will be unstable and performance degraded rather than 
improved by learning. The magnitude of the eigenvalues 
of (1 - J' 1 J) depend on how accurately J' 1 inverts J, 
and thus the convergence rate of the learning algorithm 
depends on how closely the learning operator inverts the 
controlled system. 

Input vs. output disturbance estimation 

Although our performance errors are due to errors in 
modeling the controlled system, the model-based learn- 
ing algorithm was derived by assuming that an unknown 
error was added to the command. In the kinematic tip 
positioning example a constant command disturbance 
would correspond to constant joint angle offsets added 
to the commanded joint angles. The learning algorithm 
just described can be viewed as an iterative procedure 
to estimate a command disturbance. 

An alternative version of the model-based learning al- 
gorithm is suggested by assuming that the major source 
of errors are output (performance) disturbances rather 
than input (command) disturbances. In the kinematic 
example just presented, the camera measuring tip posi- 
tion could have an unknown offset, A. This offset could 
initially be assumed to be zero, and after each position- 
ing attempt an estimate of the offset could be refined by 
subtracting the tip position error: 



A' = A'- 



(x*- 1 - Xd) 



(16) 



The estimated output offset would be added to the de- 
sired tip position when the next joint angle command 
was computed: 



L- l (x, + A') 



(17) 



Equations (16) and (17) replace equations (9) and (10) in 
the input disturbance version of the model-based learn- 
ing algorithm to form the output disturbance version. 

Representing possible modeling errors as either in- 
put or output errors is a modeling decision that depends 
on the assumed source of the modeling errors. In the 
output disturbance version of the model-based learning 
algorithm, as in the input disturbance version, the per- 
formance error is mapped through an inverse model of 
the controlled system to calculate a command correction. 
The output disturbance model-based learning algorithm 
has similar convergence properties as the input distur- 
bance algorithm. 
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Figure 3: The throwing task. 



3 Learning to Throw 

Model-based learning can be used to improve perfor- 
mance on a complete task, in addition to improving po- 
sitioning. As an example of task teyel learning, a robot 
arm was programmed to throw balls at a target. The 
robot throwing accuracy improved with practice. 

Figure 3 illustrates the apparatus used in the throw- 
ing experiments. The target was at the center of a large 
metal plate, which was placed approximately 5 meters 
from the base of the robot. For this throwing task only 
the height of the ball when it hit the target plate was 
monitored and improved by a learning algorithm. 

The last link of a three joint direct drive arm was 
used as a catapult to throw a ball. The robot was po- 
sitioned so that the last link of the arm rotated in a 
vertical plane. The last joint was servoed to a fifth or- 
der polynomial trajectory that began at rest at 225° and 
ended at rest at 45°. A 4cm diameter rubber ball was 
placed onto a 3.5cm diameter hole at the end of the last 
link. The ball left the hole as the robot arm decelerated 
during the throw. No release mechanism was used. The 
release position of the ball was assumed to be when the 
last link was at 135°. The distance the ball was thrown 
was controlled by changing the duration of the throw- 
ing movement, which changed the release velocity. A 
shorter duration and therefore faster movement threw 
the ball higher and further, and a longer duration move- 
ment threw the ball lower and closer. 

A video camera was used to record where the ball hit 
the target plate. The impact of the ball was sensed by 



a force sensor on which the target plate was mounted. 
This signal was used to choose video frames to be stored 
for later analysis. After the throw, the location of the 
ball on the target plate was manually measured from the 
appropriate video frame. 

The initial release velocity command was calculated 
by measuring the distance to the target and using a sim- 
ple ballistics model, incorporating only gravity, to pre- 
dict the required flight trajectory given the assumed re- 
lease position and initial direction of ball flight. The cor- 
responding trajectory duration was computed and the 
calculated trajectory executed. On the first throw the 
ball hit the target plate 28cm above the target. The 
model-based learning algorithm based on estimating an 
output offset (equations (16) and (17)) was used to im- 
prove performance on the throwing task. This output 
offset learning algorithm corresponds to our intuition 
that we should aim lower if we are hitting too high, and 
vice versa. The role of the internal model is to calcu- 
late how much the aim should be changed. The bal- 
listics model used to generate initial performance was 
also used to calculate the appropriate release velocity as 
the aim was offset by the estimated disturbance amount. 
The open squares in Figure 4 show the throwing perfor- 
mance during model-based learning. In this particular 
experiment the ball hit the target on the eighth throw. 
The open triangles in Figure 4 indicate the perfor- 
mance of a model-based learning algorithm that improves 
the model as well as refining the command. This algo- 
rithm will be discussed in a later paper. 
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Figure 4: Performance of the model-based learning algo- 
rithm on a throwing task. 



4 Trajectory Learning 

Trajectory execution of a robot can be improved using a 
model-based learning algorithm (Atkeson and Mclntyre 
1986a, 1986b). A model of the robot inverse dynamics is 
used as the learning operator that transforms trajectory 
following errors into feedforward command corrections. 
This form of learning is useful for refining repetitive mo- 
tions, and can also be used to refine groups of similar mo- 
tions. Model-based trajectory learning was implemented 
on the MIT Serial Link Direct Drive Arm and greatly 
reduced trajectory following errors in a small number of 
practice movements. 

The robot model used as the learning operator in 
the trajectory learning experiments was identified from 
movements of the MIT Serial Link Direct Drive Robot 
Arm (Atkeson, An, and Hollerbach 1986). The dynamics 
of this direct drive robot arm are dominated by rigid 
body dynamics, so a Newton-Euler model structure was 
used. The Newton-Euler rigid body dynamics equations 
for a robot can be written as 

r = &-*{$, $, 0) = 1(0) • + ■ C{») ■ 6 + g(«) (18) 

where 0(t) is the desired trajectory of the joint angles, 
r(t) is the vector of required torques to achieve the de- 
sired trajectory, 1(6) is the inertia matrix of the arm, 
C(0) is the Coriolis and centripetal force tensor, and 
g(0) is the gravitational force vector (Hollerbach, 1984). 
For other types of robots it is argued that additional 
sources of dynamics are important (Goor, 1985; Good, 
Sweet, and Strobel, 1985). In these cases we can still 
model the robot dynamics and invert the model. 

As before, there are several stages of the algorithm. 
The initial feedforward command is generated by ap- 
plying the model of the robot inverse dynamics to the 
desired trajectory (as in equation (7)): 



During command execution the applied command is 
the sum of the feedforward command, T//, and the out- 
put of the feedback controller, rjt,: 



!■*(*)= *'//(') +'/»(*) 



(20) 



The total applied command, r, is used as the basis 
for the next feedforward command. As described in the 
previous sections, the command error is estimated using 
the model of the robot inverse dynamics (as in equation 

(9))= 

fi\t) = A- l (f(t),f'(t),*'(«)) - R- l (0d(t)Mt)Mt)) 

(21) 
and the next feedforward command is the modified total 
command (as in equation (10)): 



r i ; f 1 (t)=r i (t)-rr\t) 



(22) 



r° ff (t) = R- 1 (e i (t)Mt)Mt)) 



(19) 



Other Approaches to Trajectory Learn- 
ing 

Recent work in a number of laboratories has focused on 
how to refine feedforward commands for repetitive move- 
ments on the basis of previous movement errors. Work 
on repeated trajectory learning includes (Arimoto et al 
1984, 1985; Casalino & Gambardella 1986; Craig 1984; 
Furuta & Yamakita 1986; Hara et al 1985; Harokopos 
1986; Mita & Kato 1985; Morita 1986; Togai & Yamano 
1986; Uchiyama 1978; Wang 1984; Wang & Horowitz 
1985). These papers discuss only linear learning oper- 
ators and emphasize the stability of the proposed algo- 
rithms. There has been little work emphasizing perfor- 
mance, i.e. the convergence rate of the algorithm. Simu- 
lations of several of these algorithms have revealed very 
slow convergence and large sensitivity to disturbances 
and sensor and actuator noise (C. G. Atkeson, unpub- 
lished results). 

An Implementation of the Trajectory 
Learning Algorithm 

The model-based trajectory learning algorithm has been 
implemented on the MIT Serial Link Direct Drive Arm 
(Atkeson and Mclntyre 1986a, 1986b). This three joint 
arm is described in (Atkeson, An, and Hollerbach 1986). 
To explore the effectiveness of the model-based trajec- 
tory learning algorithm we will present results on learn- 
ing a particular trajectory. 

The Test Trajectory: All three joints of the Direct 
Drive Arm were commanded to follow a fifth order poly- 
nomial trajectory with zero initial and final velocities and 
accelerations and a 1.5 second duration. Figure 5 shows 
the shape of the trajectory for each joint, and Table 1 
gives the initial and final joint positions, the peak joint 
velocities, and the peak joint accelerations. 



Atkeson 



Position 



Velocity 



0.5 seconds 




Figure 5: The test trajectory. 





Initial 


Final 


Peak 


Peak 




Position 


Position 


Velocity 


Acceleration 


Joint 


radians 


radians 


radiansfs 


radians / s 2 


1 


0.5 


4.5 


5.0 


±10.3 


2 


5.0 


1.0 


-5.0 


±10.3 


3 


4.0 


-0.5 


-5.6 


±11.5 



Table 1: Test trajectory parameters. 



The Feedback Controller: An independent digital 
feedback controller was implemented for each joint and 
was not modified during learning. 

Initialization Of The Feedforward Command: 
The initial feedforward torques were generated from a 
rigid body dynamics model. The model and the estima- 
tion of its parameters are described in (Atkeson, An, and 
Hollerbach, 1986). The calculated feedforward torques 
are shown in Figure 6A. 

Initial Trajectory Performance: As an index of 
trajectory following performance the velocity errors (the 
difference between the actual joint velocity and the de- 
sired joint velocity) for the first movement are shown in 
Figure 7A. We have plotted the raw velocity error data 
to give an idea of the relative size of the trajectory errors 
and sensor noise. 

Calculating Acceleration and Filtering: In or- 
der to use the rigid body inverse dynamics model to com- 
pute joint torques it was necessary to compute the joint 
accelerations. Joint positions and velocities were mea- 
sured directly. A digital differentiating filter combined 
with an 8Hz low pass filter was applied to the velocity 
data to estimate accelerations. 

To reject noise and non-repeatable disturbances and 
to compensate for high frequency unmodelled dynam- 
ics it was necessary to filter the trajectory errors and 
controller output. In this implementation we applied 
low pass digital filters with an 8Hz cutoff to the data 



used in the learning process. We filtered the references 
used by the learning operator with the same filter used 
on the data. It was also necessary to correct for incon- 
sistencies between the velocity sensors and the position 
measurements, which was done by adjusting the position 
reference to the feedback controller until the integrated 
velocity error matched the position error. 

Final Trajectory Performance: The robot exe- 
cuted two additional training movements which are not 
shown, and its performance on the fourth attempt of the 
test trajectory was assessed. Figure 6B shows the mod- 
ified feedforward commands used on the fourth move- 
ment, and should be compared with the predicted tor- 
ques shown in Figure 6A. Figure 7B shows the velocity 
errors for the fourth movement, and should be compared 
with the initial movement velocity errors in Figure 7A. 
There has been a substantial reduction in trajectory fol- 
lowing error after only three practice movements. 

5 Issues For Further Research 

Some of the questions that warrant further research in- 
clude the effect of modeling errors and non-repeatable 
disturbances on convergence, and learning of non-repeti- 
tive tasks. 

As discussed previously, the convergence of model- 
based learning algorithms depends on the quality of the 
model. Accurate models support efficient learning. Inac- 
curate models may cause learning algorithms to degrade 
performance rather than improve it. 

Reducing or filtering the estimated command correc- 
tion will make model-based learning more robust to mod- 
eling errors. Convergence will be slowed, however. Fur- 
ther research is required into the appropriate tradeoff 
between handling modeling errors and fast convergence. 
Filtering of the model-based command update also plays 
an important role in reducing the effect of non-repeatable 
disturbances. 

If intermediate sensory signals are available, then 
breaking the control system into modules and having 
each module learn independently may improve learning 
performance. We plan to explore this issue in the throw- 
ing task. If measurements are available of when and 
where the ball is released, then independent models of 
the throwing motion and the ball flight characteristics 
can be made. These independent models can be used to 
choose an appropriate release velocity separately from 
refining the trajectory that attains that release velocity. 

It is possible to modify models as well as commands 
during learning. In the examples presented in this pa- 
per the same models were used repeatedly even after it 
became clear during learning that the models had large 
errors. We have explored some methods of model refine- 
ment during practice. The open triangles of Figure 4 
show the faster convergence of a model-based learning 
algorithm that improves the model as well as the com- 
mand. 
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The model-based learning algorithms are ideally suit- 
ed to refining repetitive commands for the same tasks. 
The learning algorithms can also be applied to refining 
commands for different tasks by assuming that similar 
command errors will be made on similar tasks. An es- 
timate of the command error on one task will be useful 
for improving the command for other tasks that share 
features with the original task. 

6 Conclusion 

The main message of this paper is that models play an 
important role in learning from practice. Better models 
lead to faster correction of command errors. The incor- 
poration of learning in a control system is not a license 
to do a poor modeling job of the controlled system. The 
benefits of accurate modeling are better performance in 
all aspects of control, while the risks of inadequate mod- 
eling are poor learning performance or even degradation 
of performance with practice. 

The approach to robot learning presented here is 
based on explicit modeling of the robot and the task 
being performed. An inverse model of the task is used 
as the learning operator that processes the errors. Such 
model-based command refinement algorithms usefully 
complement other approaches to adaptive control. 

Studying model-based learning algorithms serves two 
purposes: 1) to improve robot performance, and 2) to 
increase our understanding of the role of practice and 
internal models in human motor learning. 
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